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SUMMARY 

Improving efficiency of importance sampler is at the center of research on Monte 
Carlo methods. While adaptive approach is usually difficult within the Markov Chain 
Monte Carlo framework, the counterpart in importance sampling can be justified and 
validated easily. We propose an iterative adaptation method for learning the pro- 
posal distribution of an importance sampler based on stochastic approximation. The 
stochastic approximation method can recruit general iterative optimization techniques 
like the minorization-maximization algorithm. The effectiveness of the approach in 
optimizing the Kullback divergence between the proposal distribution and the target 
is demonstrated using several simple examples. 

Some key words: Adaptive algorithm; Importance sampling; Stochastic approxima- 
tion. 
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1 INTRODUCTION 

In this paper we are concerned with the approximation of the integral 

h(x)n(x)dx = E- K h(X) 



where 7r is a density with respect to the Lebesgue measure on the Euclidean space. 
When we can sample directly from tt, the simplest Monte Carlo approach for eval- 
uating this integral is to draw independent samples Xi,i = 1,...,N from 7r and 
approximate the integral by the sample mean of h(Xi). When direct sampling is 
infeasible, the importance sampling approach comes to the rescue by first drawing 
independent samples from a proposal density /, and then use the weighted average 

N 

X 



— K x i) w i 



i=l 



to approximate the integral, where Wi = yn^y- This estimator is unbiased since 
Ef[h(X)j£l] = E W X. It is well known that the efficiency of importance sampling 
depends crucially on the choice of the proposal distribution, since the variance of the 
estimator is V ar f(h{X)ir{X) / f (X)) / N . It is obvious that we can achieve smallest 
variance with / oc | /z, | vr , but this is almost useless in practice. 

The idea of adapting the proposal distribution by utilizing the previously sam- 
pled data is a powerful one. While some schemes f or adaptation have been devised 



200 ll ). it is gener- 



under the Markov Chain Monte Carlo framework (lHaario et al. 
ally acknowledged that designing a valid scheme in this situation is a complicated 
matter and must be carried out carefully in order not to disturb the detailed bal- 
ance equation. It turns out however that adaptation in importance sampling is much 
simpler and almost no worries about the validity arise when using changing propos- 



als. This is demonstrated convincingly in ICappe et al. 



(120041 ) . Basically, the change 



of the proposal can be almost arbitrarily dependent upon previous samples due to 
the canceling effect on the proposal distribution. Adaptation within the importance 
sampling framework is generally valid, although genuine effect of adaptation does not 



always come about. 



Daucetal 



(120071 ) showed that a simplistic adaptation scheme 
within the mixture family of proposal distributions does not achieve the desired effect 
and the mixture weights stabilize into the uniform weights, while a Rao-Blackwellized 
version correctly converges to the optimal proposal. Their approach works by drawing 
N samples from the current proposal, and updating the proposal based on this pop- 
ulation. The asymptotic theory is based on the limit theorems when the population 
size N goes to infinity, while the number of iterations is kept small in practice. 

We propose an alternative scheme based on stochastic approximation, by con- 
sidering a parametric family of proposal distributions. The goal is to update the 
parameters sequentially to achieve the effect of adaptation. In the next section, we 
present our approach and state the convergence result based on simplified but strin- 
gent assumptions. In section [31 we illustrate the method by detailing some concrete 
circumstances under which our approach works. We also present some simulation 
results using some simple examples to demonstrate the adaptation ability of our ap- 
proach. This paper concludes with section HI 



2 STOCHASTIC APPROXIMATION WITH 



WEIGHTED SAMPLES 



Similar to iDouc et al.l (120071 ). we focus on adaptation with respect to it and use the 
Kullback divergence 

( \ i n ( x )j 
ir{x) log — ; ax 



/(*) 

as our criterion for efficiency Minimizing the divergence is equivalent to maximizing 
the 7r-expected log-likelihood E n \ogf(X). For the proposal distributions, we consider 
a parametric family of densities 



F = {f(-\6),6eQcR 1J }. 



Our goal is find 9 that makes E n log f(X\9) as large as possible. We assume the 
maximum is achieved by some 9* G for simplicity. Maximizing a known function 
of 9 is usually done by some Newton-like algorithms. We assume such an algorithm 
exists for maximizing the sampled version of E n \og f(X\9), 5^ log/(Xj|#). Almost 
all such algorithms can be directly extended to the weighted sample case. Such an 
algorithm defines a mapping 6* t+1 = M(9 t ), which implicitly depend on the (weighted) 
samples. We use M to denote both the algorithm and the mapping defined by the 
algorithm. The notation M(9; {Xi,Wi\i ) is also used to emphasize the dependence 
of the mapping on the weighted samples. 

For our purpose, suppose we have such an algorithm M(9) at hand, which im- 
plicitly depends on the weighted samples (Xi, Wx), . . . , (Xjv, wn). The stochastic 
approximation importance sampling algorithm is as follows: 

- Start with an initial value 9n. 



- For t = 0, 1, 



Draw samples Xf , . . . , X l N from f(-\9 t ). 

Run the algorithm M on the weighted samples (X{, w{), . . . , (X^, w^) with 
j^) {Xj) to obtain 9 t+1 = M{9 t ). 



Update 9 t+ i = 9 t + -ft(@t+i — where {74} is a sequence of decreasing 
positive numbers chosen for the algorithm to converge. 



The convergenc e of the stochastic a p proximation algorithm was stud i ed by 



thors, including 



Delyon et al. 



(|1999h : 



Jaakkola et al 



J199J); 



Tsitsiklisl f ll994h : 



many au- 



Kushner fc Yin 



(119971 ). We choose to work with a maybe overly simplified version here for clarity to 



avoid any excessive burden in the part of the readers. 



Theorem 1 ( Convergence of stochastic approximation) Let M(9) = limjv^oo M(9; {Xj, Wi}) 
be the almost sure limit when the number of weighted samples drawn from f(-\9) and 
weighted against the target tt goes to infinity. Assume (1) 9 t is contained in a convex 



and compact subset W ofQ; (2) 1(9; x) = log f(x\8) is continuously differentiable as 
a function of 9 and dgE 7T [l(9; X)] = E 7r [dgl(9;X)]; (3) there exists a unique maxi- 
mizer 9* of E n [log f(X\9)} inside W , which is also the unique stationary point; (4) 
Et=o7* = °°>E t =o7t 2 < 00 / ( 5 ) (E*dol(6;X),M(9) - 9) < 0, for all 9 eW, and it 
is zero only if 9 = 9* . 

Then 9 t converges to 9* with probability 1 as T goes to infinity. 



he t heorem is actually a much simplified version of Theorem 2 in 



Delyon et al 



(119991 ) . with E w [ l(9;X)] acting as th e Lyapunov function V in that theorem. The 



reader can check 



Delyon et al. 



(119991 ) for the proof and other discussions that greatly 
relaxed the different assumptions presented here, including the possibility of conver- 
gence to other stationary points if they exist. We choose to ignore the complications 
caused by local maximum or the unfortunate case where the parameters approach 
the boundary of the parameters sp ace and become unst able. The discussion for the 



latter concern can also be found in 



Andrieu et al. 



(120051). 



3 ILLUSTRATION AND SIMULATION 

In this section, we present some concrete instantiations of our general approach de- 
scribed above. We assume all conditions expect condition (5) above are satisfied and 
will only verify this condition in each case. 

Exponential family We consider the one-dimensional exponential family {f(x\ji) = 
exp{rj(fi)x — 0(/i)} : /x G 0}. Note we choose the mean parameterization so that 
Ef(X) = /i, and 9 = \i in this case. The parameter that achieves the minimum 
divergence is obviously 9* = E n X. In this simplest case, we do not need to resort to 
numerical optimization procedure given weighted samples {(X{,w\), . . . , (X* N , w 1 ^)}. 
We can directly estimate M(9 t ) = jjJ2Zi w t x h with M(9 t ) = 9*. That is, the 
optimal value of E n l(6\X) is reached in one single iteration in the "noiseless" case. 
With M(9 t ) — 9 t = 9* — 9 t , condition (5) in this case is verified by the concavity of 
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log/(a;|0) in 9. The iteration with TV = 1 is 

which is just a simple weighted stochastic approximation procedure to find the ex- 
pected value under ir. 

Stochastic approximation with MM algorithm Given a target function a(9) : = 
^2 i ^ogf(X i \9), MM algorithm is a general technique for iteratively finding the local 
maximum. In our context, MM stands for minorization-maximization. This algorithm 
works for weighted version a(6) := ^ Ji w i \og f(Xi\9) also. Given weighted samples 
{(X{, w\), . . . , (Xjj, w f N )} with current parameter 9 t , MM algorithm first finds a func- 
tion Q(-\9 t ) such that 

a(0)=5>?to g /(x?|0) > om), 

i 

a{e t ) = Y J ^ogf{X t l \e t ) = Q(9 t \9 t ) 

i 

The function Q is called the minorizing function of a at the point Of The new 
parameter 9 t+ \ is then chosen to be the maximizer of Q(9\9 t ). The MM algo- 
rithm increases the target function monotonically in each iteration since a(6 t +i) > 
Q(Qt+i\@t) ^ QiP t \9 t ) = a{9 t ). MM algorithm gives us a way to define M given 
the weighted samples. The corresponding "noiseless" iteration is defined by M(9 t ) = 
arg max -E/(.|6» t ) [Q($|^)] , where the expectation is taken over the independent weighted 
samples drawn from f(-\0 t ). To verify condition (5), we make the simplifying as- 
sumption that EfQ(9\9 t ) is a continuously different iable and strictly concave function. 
Since EfQ(9\9 t ) is concave and M(9 t ) is its maximizer, we have (d e E fQ(9\9 t )\e=e t -, M(9 t ) — 
9t) < 0. Note that unlike the previous case, here we do not assume log / itself to be 
concave, otherwise it is hard to justify the use of MM algorithm. By taking expec- 
tations, it is easily verified that E^.\Q t ^Q{9\9 t ) is a minorizing function of E n l(9;X) 
at 9 t . Which immediately implies d e E w l(9; X) = d e Q(9\9 t ) at 9 t , and the condition is 
verified. 
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Stochastic approximation with Mixture models We consider the family of mixtures 
models (X)dLi ^dPrfO^) : S a d = 1> a d > 0}' "where Pd(-) is fixed and only the mixing 
weights are adaptable. So 9 = (ai, . . . , Od_l) in this case. It is well known that the 
EM algorithm which is popular for mixture models is a special case of MM algorithm. 
Given the weighted sample {(X*, w\), . . . , (X l N , w^)} with current weight parameters 
9 t , the new parameter is updated as 

This gives our iterative algorithm M, with [M(9)] d = g T £ tt ^ where [0] d is the d- 
th component of the vector 9. There is an important difference between mixture model 
inferences using EM and mixture model used as proposal distribution though. For 
mixture model inferences using EM, generally the component indicator is unknown, 
and not included in the observations. In importance sampling, the weighted data 
is something we generate from the current proposal distribution. It is thus possible 
to take advantage of this and the iterative update can be made using [M(9 t )]d = 
Yli w tI{Zi — d)/N where Z\ is the mixture component index of Xj, which gives the 
same M as before. 

We perform some simple simulations to demonstrate the effectiveness of our adap- 
tation scheme. In all our examples, we choose 7 n oc 1/n although slower decreasing 
sequence may result in faster convergence rate. We only consider the approximation 
of J ir(x)dx = 1. In the first example, we let 7r be the standard normal density, and 
consider proposal distributions from the normal distribution family with adaptable 
mean and fixed variance of 1. The number of samples generated from current pro- 
posal is N = 1, and the recursion ([1]) is used for stochastic approximation. In the 
second example, we use the same 7r but the proposal distribution is chosen from the 
Cauchy family with mean and adaptable scale a. In each iteration we draw two 
weighted samples (using only one sample will make o = 0) and use MM algorithm 
with a quadratic lower bound. The second derivative of a(cr 2 ) = 5^ i=1 2 Wi log /(Xj|er) 
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is 

wi + w 2 _ 2a 2 Xf + Xf WX + W2 

~~ 2^ ^ Wi (a 2 + Xf) 2 a 4 ~ 

i=l,2 v % ' 

In practice, we assume a is inside a bounded interval so that the second derivative 
above can be bounded below by a negative constant C. Thus a(cx 2 ) is minorized by 
a(a 2 ) + (a 2 — of )a'(af) + i(a 2 — of) 2 ■ C, and the update becomes 

°ii = - 7tC7-V((7?) 

This is just a simple stochastic Newton-like update. Note C~ l can be absorbed into 
It- 

In the third example, we choose n ~ §iV(-l, 1) + §iV(2, 1), with the proposal 
family {aiV(-l, 1) + (1 - a)iV(2, 1)}. The iterations start from a = 1/2 with N = l 
in each iteration, using the update equation (j2j) where we do not explicitly use the 
cluster identities. 

For all three examples, we compare the mean square error of adaptive important 
sampling with that of using a fixed proposal distribution. The approximation vt to 
the integral can be updated using 

v t +i =v t + Jtj/j w\/N - v t ). 

i 

The mean squared errors are reported in table [Husing 1000 replications with 500 sim- 
ulated samples (i.e., 500 iterations for the first and the third examples, 250 iterations 
in the second example). The fixed proposal distribution is chosen to be the same as 
the initial distribution in the adaptive sampling, which is normal with mean 1 for 
the first example, Cauchy with scale o = 2 in the second example, and mixture of 
normal with mixing weights (1/2, 1/2) in the third example. For each of the three 
cases, another fixed proposal distribution is chosen to make the mean squared error 
close to that of adaptive sampling, which is normal with mean 0.1 in the first case, 
Cauchy with scale 1.1 in the second case, and mixture model with mixing weights 
(0.35, 0.65) in the third case. This is denoted by "fixed proposal 2" in the table. For 
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Table 1: Mean squared errors for the three examples. 



Example 1 Example 2 Example 3 

adaptive sampling 0.0000088 0.00036 0.0000032 

fixed proposal 1 0.0003543 0.00112 0.0000846 

fixed proposal 2 0.0000099 0.00056 0.0000038 

the mixture family example, we also show in Figured] the density 7r together with the 
initial proposal density and the proposal density after 100 iterations. 



4 DISCUSSION 



We have proposed an iterative adaption scheme for importance sampler based on 
stochastic approximation and demonstrated its effectiveness. Our ex amples only 



i nvolv e independent sampler although kernel- like propos als as used in 



Douc et al. 



DoucetaL 



(120071 ). where 



( 120071 ) can also be considered. Contrary to the method of 
the asymptotics are studied in which the population size N goes to infinity within 
each iteration, our method works by letting the number of iterations diverge. Other 
types of asymptotic s besides the convergence of the algorithm could be studies as in 



Delyon et al. 



(Il999l ) although this is not our main concern here. 
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Figure 1: Proposal density after adaptation. 
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